Data like the world, can seem chaotic. In order to ask questions we have to transform the data into useful structures that the we and the computer can interact with.
In this workshop we will be using the the tidyverse library, a collection of R packages that acts as an extra layer of interaction between base R and the user without significant impacts in performance. If you haven’t installed it, do it by copying the following line in the Console panel after the >:
install.packages("tidyverse")Hit Enter, the download and installation process should start. When finished, load the library by executing:
library("tidyverse")
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## -- Attaching packages -------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()Above you can see all the libraries contained in tidyverse to be loaded. Some other libraries that might be useful to install are:
install.packages(c("readxl", "psych", "skimr"))It is recommended to take care of your folder structure by making organising your project with at least 3 folders, one for your scripts, one for your data and another one for results. To avoid problems with paths:
.R in the end.getwd()Download the following files
Data extracted from Our world in data
Computer locations are structured as layers one contained in the other. In order to navigate the folder structure we have to know that:
./ Current location../ Out of the current location. It can be stacked e.g. ../..// root (usually where the important files for the system are located)~/ home directory where you can #hyggeYou can change this locations by giving full directions from the root or relative to the current folder using setwd("./directions/tofolder/inside")
Data can come in multiple formats. Look at the file extension of your data file or have a look in a text editor how it is formated. Look at the middle column of first page of the Data Import Cheat Sheet. Load the data into a variable such as my_data.
Tables in base R are considered as data.frame. Tibbles are an improved version of the data.frame, when files are imported using read_ these are formatted as Tibbles. Look at the difference by running the commands as.data.frame(my_data) and as_tibble(my_data)
When importing tables, the type of data in each column is guessed but it can also be specified. You can explore your dataset using view() in an interactive way (a new tab opens). Have a glimpse() to the imported dataset and recognise the data type of the columns:
| Description | Example | |
|---|---|---|
int |
integers | 1, 2, 3 ,4 |
dbl |
doubles or real numbers | 1.0, 2.3, 3.623, 4.78 |
chr |
characters or string (text) | “Hello”, “wild-type”, “1” |
dttm |
date-times | “2018-06-09 16:45:40” |
lgl |
logical | TRUE / FALSE |
fctr |
factors | 1, 1, 2, 3, 4, 4 Levels: 1, 2, 3, 4 |
date |
dates | “2018-06-09” |
Column types can be reformatted at any time.
Whenever is possible avoid spaces in your column names
Tables can be found in mainly designs:
In the middle column, page 2 of the Data Import Cheat Sheet you can find how to tidy your data to the suitable format. In short:
gather() to go from wide to long formatspread() to go from long to wide formatOne of the best enhancenments in R are pipes. They can be used to concatenate commands using %>%. This will pass the result of one function to as the first argument of the next function. In Windows pipes can also be introduced by Ctrl + shift + M. Example:
myresults <- mydata %>%
select(column1, column2, 3:10, -column9) %>%
filter(column1 < 0.05)There are many things you can do with your dataset. A suggested way of operating would be:
tidyverse, dplyr or R at the end of your queryA brief summary of things you can do: + select() columns + filter() values in columns + arrange() your data in a ascending or arrange(desc()) in descending order + mutate() to create new columns or overwrite existing ones + pull() a specific column as a vector + rename() columns
Considering that your data is in long format you can group your observations based on a specific column using group_by(column_name). This will allow you to perform operations and run functions per group instead of the whole dataset.
Combining tables is inspired on SQL.
Custom summaries reports can be created by using summarise(). Despite being flexible, this requires a detailed specification of the types of summaries we want to see such as mean, median, maximum values, etc. Packages like skimr or psych provide a set of out of the box summary statistics for your data. Examples based on the built in dataset esoph:
library(skimr)
##
## Attaching package: 'skimr'
## The following object is masked from 'package:stats':
##
## filter
esoph %>% skim() %>% print()
## Skim summary statistics
## n obs: 88
## n variables: 5
##
## -- Variable type:factor -------------------------------------------------------------------------------------------------------------
## variable missing complete n n_unique top_counts
## agegp 0 88 88 6 45-: 16, 55-: 16, 25-: 15, 35-: 15
## alcgp 0 88 88 4 0-3: 23, 40-: 23, 80-: 21, 120: 21
## tobgp 0 88 88 4 0-9: 24, 10-: 24, 20-: 20, 30+: 20
## ordered
## TRUE
## TRUE
## TRUE
##
## -- Variable type:numeric ------------------------------------------------------------------------------------------------------------
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## ncases 0 88 88 2.27 2.75 0 0 1 4 17 <U+2587><U+2582><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## ncontrols 0 88 88 11.08 12.72 1 3 6 14 60 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
esoph %>% describe()
## # A tibble: 5 x 13
## vars n mean sd median trimmed mad min max range skew
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 88 3.39 1.65 3 3.36 1.48 1 6 5 0.0465
## 2 2 88 2.45 1.12 2 2.44 1.48 1 4 3 0.0640
## 3 3 88 2.41 1.12 2 2.39 1.48 1 4 3 0.128
## 4 4 88 2.27 2.75 1 1.85 1.48 0 17 17 2.20
## 5 5 88 11.1 12.7 6 8.49 5.93 1 60 59 1.89
## # ... with 2 more variables: kurtosis <dbl>, se <dbl>Finally